-
Notifications
You must be signed in to change notification settings - Fork 48
Add shard connection backoff policy #473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add shard connection backoff policy #473
Conversation
0b80886
to
f62dfa3
Compare
dbb3ad1
to
cbb4719
Compare
Shouldn't we have some warning / info level log when backoff is taking place? |
I would rather not do it, it is not useful and can potentially pollute the log |
Do you know what caused the test failure?
it is a unit test that at the first glance should be fully deterministic. Failure is unexpected. |
It is known issue, conversion goes wrong somewhere |
a43ccd1
to
b0fd069
Compare
f47313f
to
9dfd9ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment: integration tests for new policies are definitely needed here.
aebc540
to
61668de
Compare
The patchset lacks documentation, which would have helped to understand the feature and when/how to use it. Is documentation a separate repo / commit? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds shard‐aware reconnection policies with support for scheduling constraints. Key changes include new policy implementations and schedulers in cassandra/policies.py, modifications to connection management in cassandra/pool.py and cassandra/cluster.py, and comprehensive tests in both unit and integration suites to validate the new behavior.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
tests/unit/test_shard_aware.py | Adds tests for both immediate and delayed reconnection behavior using new policies. |
tests/unit/test_policies.py | Introduces extensive tests for scope bucket and scheduler behavior. |
tests/unit/test_host_connection_pool.py | Updates HostConnectionPool tests to integrate the new scheduler. |
tests/integration/long/test_policies.py | Validates backoff policies and correct connection formation across shards. |
tests/integration/init.py | Adds a marker for tests designed for Scylla-specific behavior. |
cassandra/pool.py | Refactors connection replacements to use the new scheduler instead of direct submission. |
cassandra/policies.py | Implements new scheduler classes and backoff policies for shard connections. |
cassandra/cluster.py | Exposes a new property and uses the scheduler for initializing shard connections. |
61668de
to
806aba9
Compare
806aba9
to
2584555
Compare
I have added documentation to all classes. |
I don't think it's such a small feature, and I think details might be missing. I did skim briefly over the code - so I might have missed it - where's the random jitter discussed, so multiple clients when do a concurrent backoff? (again - may have missed it!) |
Add abstract classes: `ShardReconnectionPolicy` and `ShardReconnectionScheduler` And implementations: `NoDelayShardReconnectionPolicy` - policy that represents old behavior of having no delay and no concurrency restriction. `NoConcurrentShardReconnectionPolicy` - policy that limits concurrent reconnections to 1 per scope and introduces delay between reconnections within the scope.
Inject shard reconnection policy into cluster, session, connection and host pool. Drop pending connections tracking logic, since policy does that. Fix some tests that mocks Cluster, session, connection or host pool.
2584555
to
8f3670e
Compare
ok, I will add it, jitter comes from |
Introduce
ShardReconnectionPolicy
and its implementations:NoDelayShardConnectionBackoffPolicy
: no delay or concurrency limit, ensures at most one pending connection per host+shard.LimitedConcurrencyShardConnectionBackoffPolicy
: limits pending concurrent connections tomax_concurrent
per scope (Cluster or Host) using a backoff policy.The idea of this PR is to shift responsibility of scheduling
HostConnection._open_connection_to_missing_shard
fromHostConnection
toShardConnectionBackoffPolicy
, that givesShardConnectionBackoffPolicy
control over process of opening connections.This feature enables finer control over process of creating per shard connections, helping to prevent connections storms.
Fixes: #483
Solutions tested and rejected
Naive delay
Description
Policy would introduce a delay instead of executing connection creation request right away.
Policy would remember last time when connection creation was scheduled to and when it tries to schedule next request it would make sure that there is time between old and new request execution is equal or more than
delay
it is configured with.Results
It worked fine when cluster operates in a normal way.
However, during testing with artificial delays, it became clear that this approach breaks down when the time to establish a
connection exceeds the configured delay.
In such cases, connections begin to pile up - the greater the connection initialization time relative to the delay, the faster they accumulate.
This becomes especially problematic during connection storms.
As the cluster becomes overloaded and connection initialization slows down, the delay-based throttling loses its effectiveness. In other words, the more the cluster suffers, the less effective the policy becomes.
Solution
The solution was to give the policy direct control over the connection initialization process.
This allows the policy to track how many connections are currently pending and apply delays after connections are created, rather than before.
That change ensures the policy remains effective even under heavy load.
This behavior is exactly what has been implemented in this PR.
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.